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Abstract. This paper presents a unified treatment of Gaussian process 
models that extends to data from the exponential dispersion family 
and to survival data. Our specific interest is in the analysis of data 
sets with predictors that have an a priori unknown form of possibly 
nonlinear associations to the response. The modeling approach we de- 
scribe incorporates Gaussian processes in a generalized linear model 
framework to obtain a class of nonparametric regression models where 
the covariance matrix depends on the predictors. We consider, in par- 
ticular, continuous, categorical and count responses. We also look into 
models that account for survival outcomes. We explore alternative co- 
variance formulations for the Gaussian process prior and demonstrate 
the flexibility of the construction. Next, we focus on the important 
problem of selecting variables from the set of possible predictors and 
describe a general framework that employs mixture priors. We com- 
pare alternative MCMC strategies for posterior inference and achieve 
a computationally efficient and practical approach. We demonstrate 
performances on simulated and benchmark data sets. 

Key words and phrases: Bayesian variable selection, generalized linear 
models, Gaussian processes, latent variables, MCMC, nonparametric 
regression, survival data. 
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1. INTRODUCTION 

In this paper we present a unified modeling ap- 
proach to Gaussian processes (GP) that extends to 
data from the exponential dispersion family and to 
survival data. With the advent of kernel-based meth- 
ods, models utilizing Gaussian processes have be- 
come very common in machine learning approaches 
to regression and classification problems; see Ras- 
mussen and Williams (2006). In the statistical lit- 
erature GP regression models have been used as 
a nonparametric approach to model the nonlinear 
relationship between a response variable and a set of 
predictors; see, for example, O'Hagan (1978). Sacks, 
Schiller and Welch (1989) employed a stationary GP 
function of spatial locations in a regression model to 
account for residual spatial variation. Diggle, Tawn 
and Moyeed (1998) extended this construction to 
model the link function of the generalized linear 
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model (GLM) construction of McCullagh and Nelder 
(1989). Neal (1999) considered linear regression and 
logit models. 

We follow up on the literature cited above and 
introduce Gaussian process models class that 
broadens the generalized linear construction by in- 
corporating fairly complex continuous response sur- 
faces. The key idea of the construction is to in- 
troduce latent variables on which a Gaussian pro- 
cess prior is imposed. In the general case the GP 
construction replaces the linear relationship in the 
link function of a GLM. This results in a class of 
nonparametric regression models that can accom- 
modate linear and nonlinear terms, as well as noise 
terms that account for unexplained sources of varia- 
tion in the data. The approach extends to latent re- 
gression models used for continuous, categorical and 
count data. Here we also consider a class of models 
that account for survival outcomes. We explore al- 
ternative covariance formulations for the GP prior 
and demonstrate the flexibility of the construction. 
In addition, we address practical computational is- 
sues that arise in the application of Gaussian pro- 
cesses due to numerical instability in the calculation 
of the covariance matrix. 

Next, we look at the important problem of select- 
ing variables from a set of possible predictors and 
describe a general framework that employs mixture 
priors. Bayesian variable selection has been a topic 
of much attention among researchers over the last 
few years. When a large number of predictors is 
available the inclusion of noninformative variables 
in the analysis may degrade the prediction results. 
Bayesian variable selection methods that use mix- 
ture priors were investigated for the linear regression 
model by George and McCulloch (1993, 1997), with 
contributions by various other authors on special 
features of the selection priors and on computational 
aspects of the method; see Chipman, George and 
McCulloch (2001) for a nice review. Extensions to 
linear regression models with multivariate responses 
were put forward by Brown, Vannucci and Fearn 
(1998b) and to multinomial probit by Sha et al. 
(2004). Early approaches to Bayesian variable se- 
lection for generalized linear models can be found in 
Chen, Ibrahim and Yiannoutsos (1999) and Raftery, 
Madigan and Volinsky (1996). Survival models were 
considered by Volinsky et al. (1997) and, more re- 
cently, by Lee and Mallick (2004) and Sha, Tadesse 
and Vannucci (2006). As for Gaussian process mod- 
els, Linkletter et al. (2006) investigated Bayesian 



variable selection methods in the linear regression 
framework by employing mixture priors with a spike 
at zero on the parameters of the covariance matrix 
of the Guassian process prior. 

Our unified treatment of Gaussian process models 
extends the line of work of Linkletter et al. (2006) 
to more complex data structures and models. We 
transform the covariance parameters and explore 
designs and MCMC strategies that aim at produc- 
ing a minimally correlated parameter space and ef- 
ficiently convergent sampling schemes. In particu- 
lar, we find that Metropolis-within-Gibbs schemes 
achieve a substantial improvement in computational 
efficiency. Our results on simulated data and bench- 
mark data sets show that GP models can lead to im- 
proved predictions without the requirement of pre- 
specifying higher order and nonlinear additive func- 
tions of the predictors. We show, in particular, that 
a Gaussian process covariance matrix with a single 
exponential term is able to map a mixture of linear 
and nonlinear associations with excellent prediction 
performance. 

GP models can be considered part of the broad 
class of nonparametric regression models of the type 
y = /(x) + error, with y an observed (or latent) re- 
sponse, / an unknown function and x a p-dimensio- 
nal vector of covariates, and where the objective is 
to estimate the function / for prediction of future 
responses. Among possible alternative choices to GP 
models, one famous class is that of kernel regression 
models, where the estimate of / is selected from the 
set of functions contained in the reproducing ker- 
nel Hilbert space (RKHS) induced by a chosen ker- 
nel. Kernel models have a long and successful his- 
tory in statistics and machine learning [see Parzen 
(1963), Wahba (1990) and Shawe-Taylor and Cris- 
tianini (2004)] and include many of the most widely 
used statistical methods for nonparametric estima- 
tion, including spline models and methods that use 
regularized techniques. Gaussian processes can be 
constructed with kernel convolutions and, therefore, 
GP models can be seen as contained in the class 
of nonparametric kernel regression with exponen- 
tial family observations. Rasmussen and Williams 
(2006), in particular, note that the GP construction 
is equivalent to a linear basis regression employing 
an infinite set of Gaussian basis functions and results 
in a response surface that lies within the space of 
all mathematically smooth, that is, infinitely mean 
square differentiable, functions spanning the RKHS. 
Constructions of Bayesian kernel methods in the 
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context of GP models can be found in Bishop (2006) 
and Rasmussen and Williams (2006). 

Another popular class of nonparametric spline re- 
gression models is the generalized additive models 
(GAM) of Ruppert, Wand and Carroll (2003), that 
employ linear projections of the unknown function / 
onto a set of basis functions, typically cubic splines 
or B-splines, and related extensions, such as the 
structured additive regression (STAR) models of 
Fahrmeir, Kneib and Lang (2004) that, in addition, 
include interaction surfaces, spatial effects and ran- 
dom effects. Generally speaking, these regression mo- 
dels impose additional structure on the predictors 
and are therefore better suited for the purpose of 
interpretability, while Gaussian process models are 
better suited for prediction. Extensions of STAR 
models also enable variable selection based on spike 
and slab type priors; see, for example, Panagiotelis 
and Smith (2008). 

Ensamble learning models, such as bagging, boost- 
ing and random forest models, utilize decision trees 
as basis functions; see Hastie, Tibshirani and Fried- 
man (2001). Trees readily model interactions and 
nonlinearity subject to a maximum tree depth con- 
straint to prevent overfitting. Generalized boosting 
models (GBMs), as an example, such as the Ada- 
Boost of Freund and Schapire (1997), represent a non- 
linear function of the covariates by simpler basis func- 
tions typically estimated in a stage-wise, iterative 
fashion that successively adds the basis functions to 
fit generalized or pseudo residuals obtained by mini- 
mizing a chosen loss function. GBMs accommodate 
dichotomous, continuous, event time and count res- 
ponses. These models would be expected to produce 
similar prediction results to GP regression and clas- 
sification models. We explore their behavior on one 
of the benchmark data sets in the application section 
of this paper. Notice that GBMs do not incorporate 
an explicit variable selection mechanism that allows 
to exclude nuisance covariates, as we do with GP 
models, although they do provide a relative measure 
of variable importance, averaged over all trees. 

Regression trees partition the predictor space and 
fit independent models in different parts of the input 
space, therefore facilitating nonstationarity and lead- 
ing to smaller local covariance matrices. "Treed GP" 
models are constructed by Gramacy and Lee (2008) 
and extend the constant and linear construction of 
Chipman, George and McCulloch (2002). A prior is 
specified over the tree process, and posterior infer- 
ence is performed on the joint tree and leaf models. 



The effect of this formulation is to allow the corre- 
lation structure to vary over the input space. Since 
each tree region is composed of a portion of the ob- 
servations, there is a computational savings to gen- 
erate the GP covariance matrix from m r < n obser- 
vations for region r. The authors note that treed GP 
models are best suited ". . .towards problems with a 

smaller number of distinct partitions " So, while 

it is theoretically possible to perform variable selec- 
tion in a forward selection manner, in applications 
these models are often used with single covariates. 

The rest of the paper is organized as follows: In 
Section 2 we formally introduce the class of GP mod- 
els by broadening the generalized linear construc- 
tion. We also extend this class to include models for 
survival data. Possible constructions of the GP co- 
variance matrix are enumerated in Section 3. Prior 
distributions for variable selection are discussed in 
Section 4 and posterior inference, including MCMC 
algorithms and prediction strategies, in Section 5. 
We include simulated data illustrations for continu- 
ous, count and survival data regression in Section 6, 
followed by benchmark applications in Section 7. 
Concluding remarks and suggestions for future re- 
search are in Section 8. Some details on computa- 
tional issues and related pseudo-code are given in 
the Appendix. 

2. GAUSSIAN PROCESS MODELS 

We introduce Gaussian process models via a uni- 
fied modeling approach that extends to data from 
the exponential dispersion family and to survival 
data. 

2.1 Generalized Models 

In a generalized linear model the monotone link 
function g(-) relates the linear predictors to the cano- 
nical parameter as g{r]i) = x-/3, with rji the canonical 
parameter for the ith observation, Xj = (xi, ■ ■ ■ , x p )' 
a p x 1 column vector of predictors for the ith sub- 
ject and j3 the coefficient vector (3= . . . , /3 P )'. 
A broader class of models that incorporate fairly 
complex continuous response surfaces is obtained 
by introducing latent variables on which a Gaus- 
sian process prior is imposed. More specifically, the 
latent variables z(x.i) define the values of the link 
function as 

(!) g{m) = z(xi), i = l,...,n, 
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and a Gaussian process (GP) prior on the nxl 
latent vector is specified as 

(2) z(X) = (*(xi), . . . , z(x n ))' ~ N(0, C), 

with the nxn covariance matrix C a fairly complex 
function of the predictors. This class of models can 
be cast within the model-based geostatistics frame- 
work of Diggle, Tawn and Moyeed (1998), with the 
dimension of the space being equal to the number 
of covariates. 

The class of models introduced above extends to 
latent regression models used for continuous, cate- 
gorical and count data. We provide some details on 
models for continuous and binary responses and for 
count data, since we will be using these cases in our 
simulation studies presented below. GP regression 
models are obtained by choosing the link function 
in (1) as the identity function, that is, 

(3) y = z(X) + e, 

with y the nxl observed response vector, z(X) 
an ?i-dimensional realization from a GP as in (2), 
and e rsj Af(0, ~I n ) with r a precision parameter. 
A Gamma prior can be imposed on r, that is, r ~ 
Q{a r ,b r ). Linear models of type (3) were studied by 
Neal (1999) and Linkletter et al. (2006). One notices 
that, by integrating z(X) out, the marginalized like- 
lihood is 
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that is, a regression model with the covariance ma- 
trix of the response depending on the predictors. Non- 
linear response surfaces can be generated as a func- 
tion of those covariates for suitable choices of the co- 
variance matrix. We discuss some of the most popu- 
lar in Section 3. 

In the case of a binary response, class labels ti € 
{0, 1} for i = 1, ... ,n are observed. We assume U ~ 
Binomial(l;pj) and define pi = P{ti = l|z(Xj)) with 
z(X) as in (2). For logit models, for example, we 
have p, L = F(z(xj)) = 1/[1 + exp(— z(xj))]. Similarly, 
for binary probit we can directly define the inverse 
link function as pi = <£(z(xj)), with <£(•) the cdf of 
standard normal distribution. However, a more com- 
mon approach to inference in probit models uses 
data augmentation; see Albert and Chib (1993). This 
approach defines latent values yi which are related 
to the response via a regression model, that is, in 
our latent GP framework, j/j = z(xj) + £j, with ~ 
jV(0, 1), and associated to the observed classes, ti, 



via the rule ti = 1 if yi > and ti = if yi < 0. No- 
tice that the latent variable approach results in a GP 
on y with a covariance function obtained by adding 
a "jitter" of variance one to C, with a similar effect 
of the noise component in the regression models (3) 
and (4). Neal (1999) argues that an effect close to 
a probit model can be produced by a logit model by 
introducing a large amount of jitter in its covariance 
matrix. Extensions to multivariate models for con- 
tinuous and categorical responses are quite straight- 
forward. 

As another example, count data models can be 
obtained by choosing the canonical link function for 
the Poisson distribution as log (A) = z(X) with z(X) 
as in (2). Over-dispersion, possibly caused from lack 
of inclusion of all possible predictors, is taken into 
account by modeling the extra variability via ran- 
dom effects, Ui, that is, Aj = exp(z(xj) + ui) = 
exp(z(xj)) exp(uj) = Aj<5j. For identifiability, one can 
impose E(<5j) = 1 and marginalize over 5i using a con- 
jugate prior, Si ~ Q(r, r), to achieve the negative bi- 
nomial likelihood as in Long (1997), 

n(si\Xi,r) 

(5) 

r( Si + i)r(r) Vt + aJ Vt + aJ ' 

for Si £ N U {0}, with the same mean as the Poisson 
regression model, that is, E(sj) = Aj, and Var(sj) = 
Aj + A?/t, with the added parameter r capturing the 
variance inflation associated with over-dispersion. 

2.2 Survival Data 

The modeling approach via Gaussian processes ex- 
ploited above extends to other classes of models, for 
example, those for survival data. In survival studies 
the task is typically to measure the effect of a set of 
variables on the survival time, that is, the time to 
a particular event or "failure" of interest, such as 
death or occurrence of a disease. The Cox proportio- 
nal hazard model of Cox (1972) is an extremely pop- 
ular choice. The model is defined through the hazard 
rate function h(t\xi) = ho(t) exp(x^/3), where ho(-) 
is the baseline hazard function, t is the failure time 
and (3 the p-dimensional regression coefficient vec- 
tor. The cumulative baseline hazard function is de- 
noted as Ho(t) = J Q ho(u)du and the survivor func- 
tion becomes S(i|x;) = S (t) exp(x * /3) , where S (t) = 
exp{— H()(t)} is the baseline survivor function. 

Let us indicate the data as (ti,xi, di), . . . , (t n ,,x n ,, 
d n ) with censoring index di = if the observation 
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is right censored and di = 1 if the failure time t{ is 
observed. A GP model for survival data is defined as 

(6) h(ti\z(xi)) = h (ti)exp(z(xi)), i = 1,2, . . . ,n, 

with z(X) as in (2). In this general setting, defining 
a probability model for Bayesian analysis requires 
the identification of a prior formulation for the cu- 
mulative baseline hazard function. One strategy of- 
ten adopted in the literature on survival models is 
to utilize the partial likelihood of Cox (1972) that 
avoids prior specification and estimation of the base- 
line hazard, achieving a parsimonious representa- 
tion of the model. Alternatively, Kalbfleisch (1978) 
employs a nonparametric gamma process prior on 
Ho(ti) and then calculates a marginalized likelihood. 
This "full" likelihood formulation tends to behave 
similarly to the partial likelihood one when the con- 
centration parameter of the gamma process prior 
tends to 0, placing no confidence in the initial para- 
metric guess. Sinha, Ibrahim and Chen (2003) ex- 
tend this theoretical justification to time-dependent 
covariates and time- varying regression parameters, 
as well as to grouped survival data. 

3. CHOICE OF THE GP COVARIANCE 
MATRIX 

We explore alternative covariance formulations for 
the Gaussian process prior (2) and demonstrate the 
flexibility of the construction. In general, any plau- 
sible relationship between the covariates and the re- 
sponse can be represented through the choice of C, 
as long as the condition of positive definiteness of the 
matrix is satisfied; see Thrun, Saul and Scholkopf 
(2004). In the Appendix we further address practical 
computational issues that arise in the application of 
Gaussian processes due to numerical instability in 
the construction of the covariance matrix and the 
calculation of its inverse. 

3.1 1-term vs. 2-term Exponential Forms 

We consider covariance functions that include a 
constant term and a nonlinear, exponential term as 

(7) C = Cov(z(X)) = ^J n + ^exp(-G), 

with J n an n x n matrix of l's and exp(G) a matrix 
with elements exp(gij), where gu = (xj — Xj)'P(xj — 
Xj) and P = diag(-log(/9i,...,p p )), with p k 6 [0,1] 
associated to x^, k = 1, . . . ,p. In the literature on 
Gaussian processes a noise component, called "jit- 
ter," is sometimes added to the covariance matrix C, 



in addition to the term (1/A)J, in order to make the 
matrix computations better conditioned; see Neal 
(1999). This is consistent with the belief that there 
may be unexplained sources of variation in the data, 
perhaps due to explanatory variables that were not 
recorded in the original study. The parametrization 
of G we adopt allows simpler prior specifications 
(see below), and it is also used by Linkletter et al. 
(2006) as a transformation of the exponential term 
used by Neal (1999) and Sacks, Schiller and Welch 
(1989) in their formulations. Neal (1999) notices that 
introducing an intercept in model (3), with preci- 
sion parameter A a , placing a Gaussian prior on it 
and then marginalizing over the intercept produces 
the additive covariance structure (7). The parame- 
ter for the exponential term, X z , serves as a scaling 
factor for this term. In our empirical investigations 
we found that construction (7) is sensitive to scaling 
and that best results can be obtained by normalizing 
X to lie in the unit cube, [0, l] p , though standardiz- 
ing the columns to mean and variance 1 produces 
similar results. 

The single-term exponential covariance provides 
a parsimonious representation that enables a broad 
class of linear and nonlinear response surfaces. Plots 
(a)-(c) of Figure 1 show response curves produced 
by utilizing a GP with the exponential covariance 
matrix (7) and three different values of p. One read- 
ily notes how higher order polynomial-type response 
surfaces can be generated by choosing relatively lower 
values for p, whereas the assignment of higher values 
provides lower order polynomial-type that can also 
include roughly linear response surfaces [plot (c)]. 

We also consider a two-term covariance obtained 
by adding a second exponential term to (7), that is, 

C = Cov(z(X)) 

(8) 

= T-Jn + t — exp(-Gi) + - — exp(-G 2 ), 

where Gi and G2 are parameterized as Pi 
= diag(- log(pi,i, • • • , pi, P )) and P 2 = diag(- log(p 2 ,i, 
• • • )P2,p))) respectively. As noted in Neal (2000), ad- 
ding multiple terms results in rougher, more com- 
plex, surfaces while retaining the relative computa- 
tional efficiency of the exponential formulation. For 
example, plot (d) of Figure 1 shows examples of sur- 
faces that can be generated by employing the 2-term 
covariance formulation with (pi, p 2 ) = (0.5, 0.05) and 
(A M = 1,A 2 , Z = 8). 
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(a) 1-term Covariance: p 1 = 0.5 (b) 1-term Covariance: p 1 = 0.05 



0.4 1 1 1 1 1 1 1 1 1 2 




Fig. 1. Response curves drawn from a GP. Each plot shows two (solid and dashed) random realizations. Plots (a)-(c) were 
obtained with the exponential covariance (7) and plot (d) with the 2-term formulation (8). Plots (e) and (f) show realizations 
from the matern construction. All curves employ a one- dimensional covariate. 
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3.2 The Matern Construction 

An alternative choice to the exponential covari- 
ance term is the Matern formulation. This intro- 
duces an explicit smoothing parameter, v, such that 
the resulting Gaussian process is k times differen- 
tial) le for k < v, 

C(z(x i ),z(x i )) 

(9) 

with d(xj,Xj) = (xj — Xj)'P(xi — Xj), K v {-) the Bes- 
sel function and P parameterized as in (7). Banerjee 
et al. (2008) employ such a construction with v fixed 
to 0.5 for modeling a spacial random effects process 
characterized by roughness. One recovers the expo- 
nential covariance term from the Matern construc- 
tion in the limit as v — > oo. However, Rasmussen and 
Williams (2006) point out that two formulations are 
essentially the same for v > |, as confirmed by our 
own simulations. 

4. PRIOR MODEL FOR BAYESIAN VARIABLE 
SELECTION 

The unified modeling approach we have described 
allows us to put forward a general framework for 
variable selection that employs Bayesian methods 
and mixture priors for the selection of the predic- 
tors. In particular, variable selection can be achieved 
within the GP modeling framework by imposing "spi- 
ke-and-slab" mixture priors on the covariance pa- 
rameters in (7), that is, 

(10) 7r(p fc | 7fc ) = 7fe I[0 < p k < 1] + (1 - 7fc)<*i(Pfc), 

for k = l,...,p, with 5\(-) a point mass distribu- 
tion at one. Clearly, p k = 1 causes the predictor xj~ 
to have no effect on the computation for the GP co- 
variance matrix. This formulation is similar in spirit 
to the use of selection priors for linear regression 
models and is employed by Linkletter et al. (2006) 
in the univariate GP regression framework (3). Fur- 
ther Bernoulli priors are imposed on the selection 
parameters, that is, 7^ ~ Bernoulli(afc) and Gamma 
priors are specified on the precision terms (A a , X z ). 

Variable selection with a covariance matrix that 
employs two exponential terms as in (8) is more 
complex. In particular, one can select covariates sep- 
arately for each exponential term by assigning a spe- 
cific set of variable selection parameters to each term, 
that is, (71,72) associated to (p 1 ,p 2 ), and simply 



extending the single term formulation via indepen- 
dent spike-and-slab priors of the form 

7r(pi,fe|7i,fe) 

(11) 

= 7i,fel[0 < pi,k < 1] + (1 " li,k)Si(pi, k ), 

7T(P2,fc|72,fc) 

(12) 

= 72,fel[0 < P 2,k < 1] + (1 - 72,fc)<W2,ifc), 

with k = 1, . . . ,p. Assuming a priori independence 
of the two model spaces, Bernoulli priors can be im- 
posed on the selection parameters, that is, 7^ ~ 
Bernoulli(aj i fc), i = 1, 2. This variable selection frame- 
work identifies the association of each covariate, x k , 
to one or both terms. Final selection can then be ac- 
complished by choosing the covariates in the union 
of those selected by either of the two terms. An 
alternative strategy for variable selection may em- 
ploy a common set of variable selection parameters, 
7 = (71, . . . , 7 p ) for both pi and p 2 , in a joint spike- 
and-slab (product) prior formulation, 

ir(pi 7 k,P2,k\lk) 
(13) =7feI[0<Pi,fc<l]I[0<p 2 ,fc<l] 

+ (1 -7fc)<*l(Pl,fc)<*l(P2,ft)) 

where we assume a priori independence of the pa- 
rameter spaces, pi and p 2 - This prior choice focuses 
more on overall covariate selection, rather than si- 
multaneous selection and assignment to each term 
in (8). While we lose the ability to align the p{ ^ 
to each covariance function term, we expect to im- 
prove computational efficiency by jointly sampling 
(7,p 1; p 2 ) a t each iteration of the MCMC scheme 
as compared to a separate joint sampling on (■y 1 , pi) 
and (72)^2)- Some investigation is done in Savitsky 
(2010). 

5. POSTERIOR INFERENCE 

The methods for posterior inference we are go- 
ing to describe apply to all GP formulations, even 
though we focus our simulation work on the con- 
tinuous and count data models. We therefore ex- 
press the posterior formulation employing a gen- 
eralized notation. First, we collect all parameters 
of the GP covariance matrix in and write C = 
C(0). For example, for covariance matrix of type 
(7) we have = (p, A a ,A z ). Next, we extend our 
notation to include the selection parameter 7 by us- 
ing 7 = (p 7 ,A a ,A 2 ) to indicate that p^ = 1 when 
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7fc = 0, for k = 1, . . . ,p. For covariance of type (8) we 
write 7 = {0 7 , © 72 , A a }, where 7 = (71,72)' and 
7i = (Pi 7i ,A i)je ),i G {1,2} for prior of type (11)- 
(12) and 7 = (p ±J , p 2j , X a , \i, z , h,z) for prior of 
type (13), and similarly for the Matern construc- 
tion. Next, we define Di € {yi, {sj, z(xj)}} and D := 
{D\, . . . ,D n } to capture the observed data augmen- 
ted by the unobserved GP variate, z(X), for the la- 
tent response models [such as model (5) for count 
data]. Finally, we set h := {r, r} to group unique 
parameters ^ 7 and we collect hyperparameters 
in m := {a, b}, with a = {a\ a ,a\ z ,a r ,a T } and sim- 
ilarly for b, where a and b include the shape and 
rate hyperparameters of the Gamma priors on the 
associated parameters. With this notation we can 
finally outline a generalized expression for the full 
conditional of (7, p 7 ) as 

^(7,p 7 l©7\p-Y> D > h ' m ) 

(14) 

oc L a (-y, p 7 |0 7 \p 7 , D, h, m)vr(7), 

with L a the augmented likelihood. Notice that the 
term 7r(p 7 |7) does not appear in (14) since ir(pk\ 
Ik) = 1, for k = l,...,p. 

5.1 Markov Chain Monte Carlo — Scheme 1 

We first describe a Metropolis-Hastings scheme 
within Gibbs sampling to jointly sample (7,p 7 ), 
which is an adaptation of the MCMC model compar- 
ison (MC 3 ) algorithm originally outlined in Madigan 
and York (1995) and extensively used in the variable 
selection literature. As we are unable to marginal- 
ize over the parameter space, we need to modify the 
algorithm in a hierarchical fashion, using the move 
types outlined below. Additionally, we need to sam- 
ple all the other nuisance parameters. 

A generic iteration of this MCMC procedure com- 
prises the following steps: 

(1) Update (7, p 7 ): Randomly choose among three 
between-models transition moves: 

(i) Add: set 7^ = 1 and sample p' k from a U(0, 1) 
proposal. Position k is randomly chosen from the set 
of fc's where 7^. = at the previous iteration. 

(ii) Delete: set (7^ = 0,p' k = 1). This results in 
covariate being excluded in the current iteration. 
Position k is randomly chosen from among those 
included in the model at the previous iteration. 

(iii) Swap: perform both an Add and Delete move. 
This move type helps to more quickly traverse a large 
covariate space. 



The proposed value (~f',p',) is accepted with prob- 
ability, 

f 7r(y , p 7 ,|0y \pL, D, h, m)q(j\j') > 
a = mm< 1, — ; — — - — — >, 

I n7,p 7 l© 7 \p7> D > h > m M7 It) J 

where the ratio of the proposals q(p 1 )/q(p ' ,) drops 
out of the computation since we employ a U(0, 1) 
proposal. 

(2) Execute a Gibbs-type move, Keep, by sam- 
pling from a U(0, 1) all p' k 's such that 7^ = 1. This 
move is not required for ergodicity, but it allows to 
perform a refinement of the parameter space within 
the existing model, for faster convergence. 

(3) Update {X a , X z }: These are updated using Me- 
tropolis-Hastings moves with Gamma proposals cen- 
tered on the previously sampled values. 

(4) Update h: Individual model parameters in h 
are updated using Metropolis-Hastings moves with 
proposals centered on the previously sampled values. 

(5) Update z: Jointly sample z for latent response 
models using the approach enumerated in Neal (1999) 
with proposal z' = (1 — e 2 ) 1 / 2 z + eLu, where u is 
a vector of i.i.d. standard Gaussian values and L 
is the Cholesky decomposition of the GP covarian- 
ce matrix. For faster convergence R consecutive up- 
dates are performed at each iteration. 

Green (1995) introduced a Markov chain Monte 
Carlo method for Bayesian model determination for 
the situation where the dimensionality of the param- 
eter vector varies iteration by iteration. Recently, 
Gottardo and Raftery (2008) have shown that the 
reversible jump can be formulated in terms of a mix- 
ture of singular distributions. Following the results 
given in their examples, it is possible to show that 
the acceptance probability of the reversible jump for- 
mulation is the same as in the Metropolis-Hastings 
algorithm described above, and therefore that the 
two algorithms are equivalent; see Savitsky (2010). 

For inference, estimates of the marginal poste- 
rior probabilities of 7/0 = 1, for k = 1 . . . ,p, can be 
computed based on the MCMC output. A simple 
strategy is to compute Monte Carlo estimates by 
counting the number of appearances of each covari- 
ate across the visited models. Alternatively, Rao- 
Blackwellized estimates can be calculated by aver- 
aging the full conditional probabilities of 7^ = 1. Al- 
though computationally more expensive, the latter 
strategy may result in estimates with better pre- 
cision, as noted by Guan and Stephens (2011). In 
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all simulations and examples reported below we ob- 
tained satisfactory results by estimating the marginal 
posterior probabilities by counts restricted to be- 
tween-models moves, to avoid over-estimation. 

5.2 Markov Chain Monte Carlo — Scheme 2 

Next we enumerate a Markov chain Monte Carlo 
algorithm to directly sample (7, p 7 ) with a Gibbs 
scan that employs a Metropolis acceptance step. We 
formulate a proposal distribution of a similar mix- 
ture form as the joint posterior by extending a re- 
sult from Gottardo and Raftery (2008) to produce 
a move to (7^ = 0, pt = 1), as well as to (7^ = l,pk = 
[0,1))- 

A generic iteration of this MCMC procedure com- 
prises the following steps: 

(1) For k = l,...,p perform a joint update for 
ilk,Pk) with two moves, conducted in succession: 

(i) Between-models: Jointly propose a new model 
such that if 7^ = 1, propose j' k = and set p' k = 
1; otherwise, propose 7^ = 1 and draw p' k ~ 14(0, 1). 
Accept the proposal for {"i' k -,p' k ) with probability, 



a 



min< 1, 



, D,h,m) 



(*0 



©V ,D,h, m) 

T(fe)' ' 



where now 7^ := (7J, . . .^ k _ v j k +i, ■ ■ ■ ,lp) and si- 
milarly for p'^ G ®y n , • The joint proposal ratio for 



(lk,Pk), reduces to 1 since we employ a U(0, 1) pro- 
posal for pk € [0, 1] and a symmetric Dirac measure 
proposal for 7^. 

(ii) Within model: This move is performed only if 
we sample j' k = 1 from the between-models move, in 
which case we propose 7^ = 1 and, as before, draw 
p" k ~ U(0, 1). Similar to the between-models move, 
accept the joint proposal for ('y'^p'D with probabil- 
ity, 



Q 



mm 



r ^,^IV (fc) ,ey (fc) ,D,h, 

1 ,7r (7fc,Pfcl7( fc) ,©y (fe) ,D,h, 



111 : 



m) 



which further reduces to just the ratio of posteriors 
since we propose a move within the current model 
and utilize a U(0, 1) proposal for p^. 

(2) Sample the parameters {A a ,A 2 ,h} and latent 
responses z as outlined in scheme 1. 

In simulations we also investigate performances 
of an adaptive scheme that employs a proposal with 
tuning parameters adapted based on "learning" from 



the data. In particular, we employ the method of 
Haario, Salesman and Tamminen (2001) for our Ber- 
noulli proposal for j\a to successively update the 
mean parameter, at,k = 1, . . . ,p, based on prior sam- 
pled values for 7^. The construction does not re- 
quire additional likelihood computations and it is 
expected to achieve more rapid convergence in the 
model space than the nonadaptive scheme. Roberts 
and Rosenthal (2007) and Ji and Schmidler (2009) 
note conditions under which adaptive schemes achie- 
ve convergence to the target posterior distribution. 

Schemes 1 and 2 we enumerated above may be 
easily modified when employing the 2-term covari- 
ance formulation (8); see Savitsky (2010). 

5.3 Prediction 

Let zj = z(X.f) be an ray x 1 latent vector of future 
cases. We use the regression model (3) to demon- 
strate prediction under the GP framework. The joint 
distribution over training and test sets is defined to 



be z* := [z'y f ]'~A/"(0,C 



J n-\-rif ■- 



l+Uf, 



->(X,X) 



with covariance, 



where C(x,x) := C(x,x)(®)- The conditional joint 
predictive distribution over the test cases, z^|z, is 
also multivariate normal distribution with expecta- 
tion E[z^|z] = C( X/ ,x)C^ -jgZ. Estimation is based 
on the posterior MCMC samples. Here we take a com 
putationally simple approach by first estimating z as 
the mean of all sampled values of z, defining 

(15) D(0):=C (X/)X) C^ x) z, 
and then estimating the response value as 



(16) 



1 K 



(t) ' 



t=l 



with K the number of MCMC iterations and where 
calculations of the covariance matrices in (15) are re- 
stricted to the variables selected based on the margi- 
nal posterior probabilities of 7^ = 1. A more coher- 
ent estimation procedure, that may return more pre- 
cise estimates but that is also computationally more 
expensive, would compute Rao-Blackwellized esti- 
mates by averaging the predictive probabilities over 
all visited models; see Guan and Stephens (2011). 
In the simulations and examples reported below we 
have calculated (16) using every 10th MCMC sam- 
pled value, to provide a relatively less correlated 
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sample and save on computational time. In addi- 
tion, when computing the variance product term in 
(15), we have employed the Cholesky decomposition 
C = LL', following Neal (1999), to avoid direct com- 
putation of the inverse of C(x,x)- 

For categorical data models, we may predict the 
new class labels, t/, via the rule of largest proba- 
bility in the case of a binary logit model, with esti- 
mated latent realizations Zf, and via data augmen- 
tation based on the values of y/ in the case of a bi- 
nary probit model. 

5.3.1 Survival Function Estimation For survival 
data it is of interest to estimate the survivor function 
for a new subject with unknown event time, Tj, and 
associated Zfj := Zf 7 i(xf t i). This is defined as 

P(Ti > t\zf ti ,z) = Si(t\zf i,z) 

(17) 

= S (t\z) c ^ z ^ . 

When using the partial likelihood formulation an 
empirical Bayes estimate of the baseline survivor 
function, So(t\z), must be calculated, since the model 
does not specifically enumerate the baseline haz- 
ard. Weng and Wong (2007), for example, propose 
a method that discretizes the likelihood to produce 
an estimator with the useful property that it cannot 
take negative values. Accuracy of this estimate may 
be potentially improved by Rao-Blackwellizing the 
computation by averaging over the MCMC runs. 

6. SIMULATION STUDY 

6.1 Parameter Settings 

In all simulations and applications reported in this 
paper we set both priors on A a and A 2 as Q(l,l). 
We did not observe any strong sensitivity to this 
choice. In particular, we considered different choices 
of the two parameters of these Gammma priors in 
the range (0.01, 1), keeping the prior mean at 1 but 
with progressively larger variances, and observed ve- 
ry little change in the range of posterior sampled 
values. We also experimented with prior mean val- 
ues of 10 and 100, which produced only a small 
impact on the posterior. For model (3) we set r ~ 
Q{a r ,b r ) with (a r ,b r ) = (2,0.1) to reflect our a pri- 
ori expected residual variance. For the count model 
(5), we set r ~ Q(l, 1). For survival data, when using 
the full likelihood from Kalbfleisch (1978) we spec- 
ified a Q(l,l) prior for both the parameter of the 
exponential base distribution and the concentration 



parameter of the Gamma process prior on the base- 
line. 

Some sensitivity on the Bernoulli priors on the 
7fc's is, of course, to be expected, since these priors 
drive the sparsity of the model. Generally speaking, 
parsimonious models can be selected by specifying 
7^ ~ Bernoulli(afc) with a^ = a and a a small per- 
centage of the total number of variables. In our sim- 
ulations we set afc to 0.025. We observed little sen- 
sitivity in the results for small changes around this 
value, in the range of 0.01-0.05, though we would 
expect to see significant sensitivity for much higher 
values of a. We also investigated sensitivity to a Beta 
hyper prior on a; see below. 

When running the MCMC algorithms indepen- 
dent chain samplers with U(0, 1) proposals for the 
pk's have worked well in all applications reported 
in this paper, where we have always approximately 
achieved the target acceptance rate of 40-60% indi- 
cating efficient posterior sampling. 

6.2 Use of Variable Selection Parameters 

We first demonstrate the advantage of introducing 
selection parameters in the model. Figure 2 shows 
results with and without the inclusion of the vari- 
able selection parameter vector 7 on a simulated 
scenario with a kernel that incorporates both linear 
and nonlinear associations. The observed continuous 
response, y, is constructed from a mix of linear and 
nonliner relationships to 4 variables, each generated 
from a U(0, 1), 

y = xi + x 2 + sin(3x 3 ) + sin(5x 4 ) + e, 

with e ~ 7V(0, a 2 ) and a = 0.05. Additional variables 
are randomly generated, again from U(0, 1). In this 
simulation we used (n,p) = (80,20). We ran 70,000 
MCMC iterations, of which 10,000 were discarded 
as burn-in. 

Plot (a) of Figure 2 displays box plots of the MCMC 
samples for the p' k s,k = 1, . . . , 20, for the case of no 
variable selection, that is, by using a simple "slab" 
prior on the Pfc's. As both Linkletter et al. (2006) 
and Neal (2000) note, the single covariates demon- 
strate an association to the response whose strength 
may be assessed utilizing the distance of the pos- 
terior samples of the p^s from 1. One notes that, 
according to this criterion, the true covariates are 
all selected. It is conceivable, however, for some of 
the unrelated covariates to be selected using the 
same criterion, since the pk's all sample below 1, and 
that this problem would be compounded as p grows. 
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(a) Posterior Samples of p 



(b) Posterior Samples of p k 




1 2345678910111213141516171819 20 

Predictor 



1 234567891011 1213141516171819 20 

Predictor 



Fig. 2. Use of variable selection parameters: Simulated data (n = 80, p = 20). Box plots of posterior samples for pk £ [0, 1]. 
Plots (a) and (b) demonstrate selection without and with, respectively, the inclusion of the selection parameter 7. 



Plot (b) of Figure 2, instead, captures results from 
employing the variable selection parameters 7 and 
shows how the inclusion of these parameters results 
in the sampled values of the p^s for variables unre- 
lated to the response being all pushed up against 1. 

This simple simulated scenario also helps us to il- 
lustrate a couple of other features. First, a single ex- 
ponential term in (7) is able to capture a wide vari- 
ety of continuous response surfaces, allowing a great 
flexibility in the shape of the response surface, with 
the linear fit being a subset of one of many types 
of surfaces that can be generated. Second, the effect 
of covariates with higher-order polynomial-like as- 
sociation to the response is captured by having esti- 
mates of the corresponding p^s further away from 1; 
see, for example, covariate X4 in Figure 2 which ex- 
presses the highest order association to the response. 

6.3 Large p 

Next we show simulation results on continuous, 
count and survival data models, for (n,p) = (100, 
1,000). We employ an additive term as the kernel 
for all models, 

y = aiXi + a 2 x 2 + 0,3X3 + ^4^4 

(18) 

+ a 5 sin(a 6 X5) + a 7 sin(a 8 x 6 ) + s. 

The functional form for the simulation kernel is de- 
signed so that the first four covariates express a lin- 
ear relationship to the response while the next two 
express nonlinear associations. Model-specific coef- 
ficient values are displayed in Table 1. Methods em- 
ployed to randomly generate the observed count and 



event time data from the latent response kernel are 
also outlined in the table. For example, the kernel 
captures the log-mean of the Poisson distribution 
used to generate count data, and it is used to gener- 
ate the survivor function that is inverted to provide 
event time data for the Cox model. As in the pre- 
vious simulation, all covariates are generated from 
W(0,1). 

We set the hyperparameters as described in Sec- 
tion 6.1. We used MCMC scheme 1 and increased 
the number of total iterations, with respect to the 
simpler simulation with only p = 20, to 800,000 it- 
erations, discarding half of them for burn-in. 

Results are reported in Table 1. While the con- 
tinuous and count data GP models readily assigned 
high marginal posterior probabilities to the correct 
covariates (figures not shown), the Cox GP model 
correctly identified only 5 of 6 predictors; see Fig- 
ure 3 for the posterior distributions of 7^ = 1 and 
the box plots for the posterior samples of p^ for this 
model (for readability, only the first 20 covariates 
are displayed). The predictive power for the con- 
tinuous and count data models was assessed by nor- 
malizing the mean squared prediction error (MSPE) 
with the variance of the test set. Excellent results 
were achieved in our simulations. For the Cox GP 
model, the averaged survivor function estimated on 
the test set is shown in Figure 4, where we ob- 
serve a tight fit between the estimated curve and the 
Kaplan-Meier empirical estimate constructed from 
the same test data. 

Though for the Cox model we only report results 
obtained using the partial likelihood formulation, 
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Table 1 

Large p: Simulations for continuous, count and survival data models with (n,p) = (100, 1,000) 



Continuous data Count data 



Cox model 



Coefficients: 
at 

«2 
03 

as 

(Hi 
07 

as 

Model 



Train/test 

Correctly selected 
False positives 
MSPE (normalized) 



1.0 
1.0 
1.0 
1.0 
1.0 
3.0 
1.0 
5.0 

Identity link 



100/20 

6 out of 6 


0.0067 



1.6 3.0 

1.6 -2.5 

1.6 3.5 

1.6 -3.0 

1.0 1.0 

3.0 3.0 

1.0 -1.0 

5.0 5.0 
log(A) = y S(t\y) = exp[-H (t) exp(y)] 
f~Pois(A) H (t) = At, A = 0.2 

t = M/(A exp(y)), M ~ Exp(l) 
5% uniform randomly censored, 

teens — £^(0, tevent ) 

100/20 100/60 

6 out of 6 5 out of 6 


0.045 see Figure 4 



Selected Predictors y 




Posterior Samples of p 



2 4 6 8 10 12 14 16 18 20 

Variable Selection Parameters: y ...y 




1 2345678910111213 14 1516171819 20 

Predictor 



Fig. 3. Cox GP model with large p: Simulated data (n = 100, p = 1,000). Posterior distributions for = 1 and box plots of 
posterior samples for pk ■ 



we conducted the same simulation study with the 
model based on the full likelihood of Kalbfleisch 
(1978). The partial likelihood model formulation pro- 
duced more consistent results across multiple chains, 
with the same data, and was able to detect much 
weaker signals. The Kalbfleisch (1978) model did, 
however, produce lower posterior values near for 
nonselected covariates, unlike the partial likelihood 
formulation, which shows values typically from 10- 



40%, pointing to a potential bias toward false posi- 
tives. 

Additional simulations, including larger sample si- 
zes cases, are reported in Savitsky (2010). 

6.4 Comparison of MCMC Methods 

We compare the 2 MCMC schemes previously de- 
scribed for posterior inference on (-y, p) on the basis 
of sampling and computational efficiency. We use 
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Table 2 

Efficiency comparison of GP MCMC methods 





MCMC scheme 2 


MCMC scheme 1 


Adaptive 


Nonadaptive 




Iterations (computation) 


5,000 


5,000 


500,000 


Autocorrelation time 








Pe 


310 


82 


441 


ps 


59 


35 


121 


Computation 








CPU-time (sec) 


980 


4,956 


10,224 





0.9 




0.8 




0.7 








0.6 






ro 




-Q 

O 


0.5 


CL 




> 


0.4 










« 


0.3 




0.2 




0.1 








-8 -6 -4 -2 2 4 6 

log Survival Time 

Fig. 4. Cox GP model with large p: Simulated data 
(n — 100, p — 1,000). Average survivor function curve on the 
validation set (dashed line) compared to the Kaplan-Meier 
empirical estimate (solid line). 

the univariate regression simulation kernel 

y = x\ + 0.8x2 + 1-3x3 + sin(x4) + sin(3x5) 

+ sin(5x 6 ) + (1.5x 7 )(1.5x 8 ) + e, 

with e ~ Af(0, a 2 ) and a = 0.05. We utilize 1,000 co- 
variates with all but the first 8 defined as nuisance. 
We use a training and a validation set of 100 obser- 
vations each. 

The two schemes differ in the way they update 
(~f,p)- While scheme 1 samples either one or two po- 
sitions in the model space on each iteration, scheme 2 
samples (jkiPk) for each of the p covariates. Be- 
cause of this a good "rule-of-thumb" should em- 
ploy a number of iterations for scheme 1 which is 
roughly p times the number of iterations employed 
for scheme 2. The use of the Keep move in scheme 1, 
however, reduces the need of scaling the number of 
iterations by exactly p, since all p^s are sampled 



at each iteration. In our simulations we found sta- 
ble convergence under moderate correlation among 
covariates for scheme 2 in 5,000 iterations and for 
scheme 1 in 500,000 iterations. For both schemes, we 
discarded half of the iterations as burn-in. The CPU 
run times we report in Table 2 are based on utiliza- 
tion of Matlab with a 2.4 GHz Quad Core (Q6600) 
PC with 4 GB of RAM running 64-bit Windows XP. 

We compared sampling efficiency looking at auto- 
correlation for selected pk- The autocorrelation time 
is defined as one plus twice the sum of the autocorre- 
lations at all lags and serves as a measure of the rel- 
ative dependence for MCMC samples. We used the 
number of MCMC iterations divided by this factor 
as an "effective sample size." We followed a proce- 
dure outlined by Neal (2000) and ran first scheme 2 
for 1,000 iterations, to obtain a state near the pos- 
terior distribution. We then employed this state to 
initiate a chain for each of the two schemes. We 
ran scheme 2 for an additional 2,000 iterations and 
scheme 1 for 200,000 (using the last 2,000 draws 
for each of the target pk for final comparison). For 
scheme 2 we used both the adaptive and nonadap- 
tive versions. Table 2 reports results for aligned 
to a covariate expressing a linear interaction, and 
for pq, for a highly nonlinear interaction. We observe 
that both versions of scheme 2 express notable im- 
provements in computational efficiency as compared 
to scheme 1. We note, however, that the adaptive 
scheme method produces draws of higher autocor- 
relation than the nonadaptive method. 

6.5 Sensitivity Analysis 

We begin with a sensitivity analysis on the prior for 
Pk\lk = 1- Table 3 shows results under a full factorial 
combination for hyperparameters (a, b) of a Beta prior 
construction, where we recall Beta(l,l) =U(0, 1). 
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(a) Posterior Samples of p 



(b) Posterior Samples of p 



1 

0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
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A 
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Fig. 5. Prior Sensitivity for pk.\jk = 1 ~ Beta(a, b) : Box plots of posterior samples for pu for (a, b) = (0.5, 0.5) — plot ( a) — and 
(a, 6) = (2.0, 2.0)— plot (h). 



Table 3 

Prior sensitivity for pk |7fc = 1 ~ Beta(a, b) . Results are 
reported as (number of false negatives) / (normalized MSPE) 



b\a 


0.5 


1.0 


2.0 


0.5 


2/0.18 


2/0.15 


2/0.18 


1.0 


1/0.14 


1/0.16 


2/0.18 


2.0 


1/0.15 


2/0.16 


2/0.17 



Results were obtained with the univariate regression 
simulation kernel 

y = x\ + X2 + sin(1.5x3) sin(1.5x4) + sin(3x5) 

+ sin(3x 6 ) + (1.5x 7 )(1.5x 8 ) + e, 

with e ~A/"(0, a 2 ) and where we employed a higher 
error variance of a = 0.28. As before, we employ 
1,000 covariates with all but the first 8 defined as 
nuisance. A training sample of 110 was simulated, 
along with a test set of 100 observations. We em- 
ployed the adaptive scheme 2, with 5,000 iterations, 
half discarded as burn-in. 

Figure 5 shows box plots of posterior samples for 
Pk for two symmetric alternatives, 1 : (a, b) = (0.5, 0.5) 
(U-shaped) and 2 : (a, b) = (2.0, 2.0) (symmetric uni- 
modal). For scenario 2 we observe a reduction in 
posterior jitter on nuisance covariates and a stabi- 
lization of posterior sampling for associated covari- 
ates, but also a greater tendency to exclude xs,x^. 
One would expect the differences in posterior sam- 
pling behavior across prior hyperparameter values to 
decline as the sample size increases. Table 3 displays 
the number of nonselected true variables (false neg- 
atives), out of 8, along with the normalized MSPEs 



for all scenarios. There were no false positives to re- 
port across all hyperparameter settings. Overall, re- 
sults are similar across the chosen settings for (a, 6), 
with slightly better performances for a < 1 and b > 
1, corresponding to strictly decreasing shapes that 
aid selection by pushing more mass away from 1 , in- 
creasing the prior probability of the good variables 
to be selected, especially in the presence of a large 
number of noisy variables. 

Next we imposed a Beta distribution on the hy- 
perparameter a of the priors 7^ ~ Bernoulli(o;) for 
covariate inclusion. We follow Brown, Vannucci and 
Fearn (1998a) to specify a vague prior by setting the 
mean of the Beta prior to 0.025, reflecting a prior 
expectation for model sparsity, and the sum of the 
two parameters of the distribution to 2. We ran 
the same univariate regression simulation kernel as 
above with the hyperparameter settings for the Beta 
prior on pk equal to (1,1) and obtained the same se- 
lection results as in the case of a fixed and a slightly 
lower normalized MSPE of 0.14. 

Last, we explored performances with respect to 
correlation among the predictors. We utilized the 
same kernel as above with 8 true predictors from 
which to construct the response. We then induced 
a 70% correlation among 20 randomly chosen nui- 
sance covariates and the true predictor xq . We found 
2 false negatives and 1 false positive, which demon- 
strates a relative selection robustness under corre- 
lation. We did observe a significant decline in nor- 
malized MSPE, however, to 0.33, as compared to 
previous runs. 
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Table 4 
Ozone data: Results 



Prior on g 






RMSPE 


Local empirical Bayes 


X5, X6, X7, X 6 , X 7 , X3X5 


6 


4.5 


Hyper-g (a = 4) 


X5, Xe, Xr, Xq, X7 , X3X5 


6 


4.5 


Fixed (BIC) 


X5 , Xe , Xr , Xq , X7 , X3 X 5 


6 


4.5 


Brown, Vannucci and Fearn (2002) 


X\ Xe, , Xi X 7 , Xe Xr , X\ , Xf , X7 


6 


4.5 


CP model 


X3 , Xe , X7 


3 


3.7 



7. BENCHMARK DATA APPLICATIONS 

We now present results on two data sets often 
used in the literature as benchmarks. For both anal- 
yses we performed inference by using the MCMC — 
scheme 2, with 5,000 iterations and half discarded 
as burn-in. 

7.1 Ozone data 

We start by revisiting the ozone data, first an- 
alyzed for variable selection by Breiman and Fried- 
man (1985) and more recently by Liang et al. (2008). 
This data set supplies integer counts for the maxi- 
mum number of ozone particles per one million par- 
ticles of air near Los Angeles for n = 330 days and 
includes an associated set of 8 meteorological pre- 
dictors. We held out a randomly chosen set of 165 
observations for validation. 

Liang et al. (2008) use a linear regression model 
including all linear and quadratic terms for a to- 
tal of p = 44 covariates. They achieve variable se- 
lection by imposing a mixture prior on the vector 
(5 of regression coefficients and specifying a g-prior 
of the type /3 7 |0~A^(O, |(X^X 7 )^ 1 ). Their results 
are reported in Table 4 with various formulations for 
g. In particular, the local empirical Bayes method of- 
fers a model-dependent maximizer of the marginal 
likelihood on g, while the hyper-g formulation with 
a = 4 is one member of a continuous set of hyper- 
prior distributions on the shrinkage factor, g/{l + 
g) ~ Beta(l,a/2 — 1). Since the design matrix ex- 
presses a high condition number, a situation that 
can at times induce poor results with g-priors, we 
additionally applied the method of Brown, Vannucci 
and Fearn (2002) who used a mixture prior of the 
type f3 1 ~JV(0, cl). Results shown in Table 4 were 
obtained from the Matlab code made available by 
the authors. 

Though previous variable selection work on the 
ozone data all choose a Gaussian likelihood, a more 



precise approach employs a discrete Poisson or neg- 
ative binomial formulation on data with low count 
values, or a log-normal approximation where counts 
are high. With a maximum value of 38 and a mean 
of 11 we chose to model the data with the negative- 
binomial count data model (5). We used the same 
hyperparameter settings as in our simulation study. 
Results are shown in Figure 6. By selecting, for ex- 
ample, the best 3 variables, we achieve a notable 
decrease in the root-MSPE as compared to the lin- 
ear models. Also, by allowing an a priori unspeci- 
fied functional form for how covariates relate to the 
response, we end up selecting a much more parsi- 
monious model, although, of course, we lose in in- 
terpretability of the selected terms, with respect to 
linear formulations that specifically include linear, 
quadratic and interactions terms in the model. 

7.2 Boston Housing data 

Next we utilize the Boston Housing data set, also 
analyzed by Breiman and Friedman (1985), who used 
an additive model and employed an algorithm to 
empirically determine the functional relationship for 
each predictor. This data set relates p = 13 predic- 
tors to the median value of owner-occupied homes in 
each of n = 506 census tracts in the Boston metro- 
politan area. As with the previous data set, we held 
out a random set of 250 observations to assess pre- 
diction. 

We employed the continuous data model (3) with 
the same hyperparameter settings as in our simu- 
lations. The four predictors chosen by Breiman and 
Friedman (1985), (xq, ^lCb in, £13), had all marginal 
posterior probability of inclusion greater than 0.9 in 
our model. Other variables with high marginal pos- 
terior probability were (X5, xj, xs, £12)- The adapt- 
ability of the GP response surface is illustrated with 
closer examination of covariate X5, which measures 
the level of nitrogen oxide (NOX) , a pollutant emit- 
ted by cars and factories. At low levels, indicating 
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Fig. 7. Boston housing data: Posterior distributions for 7/t = 1 ana! boa; plots of posterior samples for p^. 



proximity to jobs, x§ presents a positive association 
to the response, and at high levels, indicating overly 
industrialized areas, a negative association. This in- 
verted parabolic association over the covariate range 
probably drove its exclusion in the model of Breiman 
and Friedman (1985). The GP formulation is, how- 
ever, able to capture this strong nonlinear relation- 
ship as is noted in Figure 7. By using only the subset 
of the best eight predictors, we achieved a normal- 
ized MSE of 0.1 and a prediction R 2 of 0.9, very 
close to the value of 0.89 reported by Breiman and 
Friedman (1985) on the training data. 

We also employed the Matern covariance construc- 
tion (9), which we recall employs an explicit smooth- 
ing parameter, v € [0, oo). While selection results 



were roughly similar, the prediction results for the 
Matern model were significantly worse than the ex- 
ponential model, with a normalized MSPE of 0.16, 
probably due to overfitting. It is worth noticing that 
the more complex form for the Bessel function in- 
creases the CPU computation time by a factor of 
5-10 under the Matern covariance as compared to 
the exponential construction. 

For comparison, we looked at GBMs. We used ver- 
sion 3.1 of the gbm package for the R software envi- 
ronment. We utilized the same training and valida- 
tion data as above. After experimentation and use 
of 10-fold cross-validation, we chose a small value 
for the input regularization parameter, v = 0.0005, 
to provide a smoother fit that prevents overfitting. 
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Fig. 8. Boston housing data: GBM covariate analysis. Left-hand chart provides variables importance, normalized to sum up 
to 100. Right-hand plot enumerates partial association 0/213 to the response. 



Larger values of v resulted in higher prediction er- 
rors. The GBM was run for 50,000 iterations to 
achieve minimum fit error. The result provided a nor- 
malized MSPE of 0.13 on the test set, similar to, 
though slightly higher than, the GP result. The left- 
hand chart of Figure 8 displays the relative covariate 
importance. Higher values correspond to {x\%,x§, xg), 
and agree with our GP results. A number of other 
covariates show similar importance values to one an- 
other, though lower than these top 3, making it un- 
clear as to whether they are truly related or nui- 
sance covariates. Similar conclusions are reported by 
other authors. For example, Tokdar, Zhu and Ghosh 
(2010) analyze a subset of the same data set with 
a Bayesian density regression model based on lo- 
gistic Gaussian processes and subspace projections 
and found (x%3,Xq) as the most influential predic- 
tors, with a number of others having a mild influ- 
ence as well. The right-hand plot supplies a partial 
dependence plot obtained by the GBM for variable 
X13 by averaging over the associations for the other 
covariates. We note that the nonlinear association is 
not constrained to be smooth under GBM. 

8. DISCUSSION 

In this paper we have presented a unified mod- 
eling approach via Gaussian processes that extends 
to data from the exponential dispersion family and 
to survival data. Such model formulation allows for 



nonlinear associations of the predictors to the re- 
sponse. We have considered, in particular, contin- 
uous, categorical and count responses and survival 
data. Next we have addressed the important prob- 
lem of selecting variables from a set of possible pre- 
dictors and have put forward a general framework 
that employs Bayesian variable selection methods 
and mixture priors for the selection of the predic- 
tors. We have investigated strategies for posterior 
inference and have demonstrated performances on 
simulated and benchmark data. GP models provide 
a parsimonious approach to model formulation with 
a great degree of freedom for the data to define the 
fit. Our results, in particular, have shown that GP 
models can achieve good prediction performances 
without the requirement of prespecifying higher or- 
der and nonlinear additive functions of the predic- 
tors. The benchmark data applications have shown 
that a GP formulation may be appropriate in cases 
of heterogeneous covariates, where the inability to 
employ an obvious transformation would require 
higher order polynomial terms in an additive lin- 
ear fashion, or even in the case of a homogeneous 
covariate space where the transformation overly re- 
duces structure in the data. Our simulation results 
have further highlighted the ability of the GP for- 
mulation to manage data sets with p> n. 

A challenge in the use of variable selection meth- 
ods in the GP framework is to manage the numeri- 
cal instability in the construction of the GP covari- 
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ance matrix. In the Appendix we describe a projec- 
tion method to reduce the effective dimension of this 
matrix. Another practical limitation of the models 
we have described is the difficulty to use them with 
qualitative predictors. Qian, Wu and Wu (2008) pro- 
vide a modification of the GP covariance kernel that 
allows for nominal qualitative predictors consisting 
of any number of levels. In particular, the authors 
model the covariance structure under a mixture of 
qualitative and quantitative predictors by employ- 
ing a multiplicative factor against the usual GP ker- 
nel for each qualitative predictor to capture the by- 
level categorical effects. 

Some generalization of the methods we have pre- 
sented are possible. For example, as with GLM mod- 
els, we may employ an additional set of variance in- 
flation parameters in a similar construction to Neal 
(1999) and others to allow for heavier tailed distri- 
butions while maintaining the conjugate framework. 

APPENDIX: COMPUTATIONAL ASPECTS 

We focus on the exponential form (7) and intro- 
duce an efficient computational algorithm to gener- 
ate C. We also review a method of Banerjee et al. 
(2008) to approximate the inverse matrix that em- 
ploys a random subset of observations and provide 
a pseudo-code. 

A.l Generating the Covariance Matrix C 

Let us begin with the quadratic expression, G = 
{g itj } in (7). We rewrite g itj = A'^[- log(p)] with 
Aij constructed as a p x 1 vector of term-by-term 
squared differences, (xik — Xjk) 2 ,k = 1, . . . ,p. We may 
directly employ the p x 1 vector, p, as P is diago- 
nal. As a first step, we may then directly compute 
G = A[— log(p)], where Aisnxnxp. We are, how- 
ever, able to reduce the more complex structure of A 
to a two dimensional matrix form by simply stacking 
each {i,j} row of dimension 1 x p under each other 
such that our revised structure, A*, is of dimension 
n 2 x p and the computation, G = A*[— log(p)], re- 
duces to a series of inner products. Next, we note 
that log(/9fc) = for p^ = 1. So we may reduce the 
dimension for each of the n 2 inner products by re- 
ducing the dimension of p to the p 7 < p nontriv- 
ial covariates. We may further improve efficiency 
by recognizing that since our resultant covariance 
matrix, C, is symmetric positive definite, we need 
only compute the inner products for a reduced set 
of unique terms (by removing redundant rows from 



A*) and then "re- inflate" the result to a vector of 
the correct length. Finally, we exponentiate this vec- 
tor, multiply the nonlinear weight (l/A^), add the 
affine intercept term, (1/A a ), and then reshape this 
vector into the resulting n x n matrix, C. The re- 
sulting improvement in computational efficiency at 
n = 100 from the naive approach that employs dou- 
ble loops of inner products is on the order of 500 
times. 

Our MCMC scheme 2 proposes a change to pk £ p, 
one-at-a-time, conditionally on p_ k and the other 
sampled parameters. Changing a single pk requires 
updating only one column of the inner product com- 
putation of A* and [— log(p)]. Rather than conduct- 
ing an entire recomputation for C, we multiply the 
kth column of A* (with number of rows reduced 
to only unique terms in C) by log(^jf ), where 
"prop" means the proposed value for pj~. This re- 
sult is next exponentiated (to a covariance kernel), 
re- inflated and shaped into an n x n matrix, A. 
We then take the current value less the affine term, 
Cold — X^J«j an d multiply by A, term-by-term, and 
add back the affine term to achieve the new covari- 
ance matrix associated to the proposed value for pk- 
So we may devise an algorithm to update an exist- 
ing covariance matrix, C, rather than conducting an 
entire recomputation. At p = 1,000 with 6 nontriv- 
ial covariates and n = 100, this algorithm further 
reduces the computation time over recomputing the 
full covariance by a factor of 2. This efficiency grows 
nonlinearly with the number of nontrivial covariates. 

A. 2 Projection Method for Large n 

In order to ease the computations, we have also 
adapted a dimension reduction method proposed by 
Banerjee et al. (2008) for spatial data. The method 
achieves a reduced-dimension computation of the in- 
verse of the full {n x n) covariance matrix. It can 
also help with the accuracy and stability of the pos- 
terior computations when working with possibly ill- 
conditioned GP covariance matrices, particularly for 
large n. To begin, randomly choose m < n points 
(knots), sampled within fixed intervals on a grid to 
ensure relatively uniform coverage, and label these 
m points z*. Then define z m _j, n as the orthogonal 
projection of z onto the lower dimensional space 
spanned by z*, computed as the conditional expec- 
tation 

z m^n = E(z|z*) = C( z * )Z )C^ z ,^Z*. 
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We use the univariate regression framework in (3) to 
illustrate the dimension reduction from constructing 
the projection model using z m — yn in place of z(x). 
Recast the model from (3) to 

y = z m ^ n + e = Cf^C^l^z* + e, 

where ~ J\f(0,^). Then derive A n = Cov(y) = 
fin + C ( z %z) C (z*,z*) C (z*,z)- Fm aUy, employ the 
Woodbury matrix identity to transform the inverse 
computation, A" 1 = rl — r 2 Cj z „ Z JC( Z * >Z *) + 

r C( z * >z ) • C', z » Z ^]~ 1 C( Z * Z ), where the quantity inside 
the square brackets, now being inverted, is m x m, 
supplying the dimension reduction for inverse com- 
putation we seek. We note that, in the absence of 
the projection method, a large jitter term would be 
required to invert the GP covariance matrix, trad- 
ing accuracy for stability. Though the projection 
method approximates a higher dimensional covari- 
ance matrix in a lower dimensional projection, we 
yet improve performance and avoid the accuracy/sta- 
bility trade-off. We do, however, expect to use more 
iterations for MCMC convergence when employing 
a relatively lower projection ratio. 

All results shown in this paper were obtained with 
m/n = 0.35, for simulated data, and with m/n = 
0.25, for the benchmark applications, where we en- 
hanced computation stability in the presence of the 
high condition number for the design matrix. We 
have also employed the Cholesky decomposition, in 
a similar fashion as in Neal (1999), in lieu of directly 
computing the resulting m x m inverse. 

A. 3 Pseudo-code 

Procedure to Compute, C = j-J n + t- exp(— G): 
Input: data matrices; 

(Xi,X2) of dimension (n\,n2) xp 
Output: function, [-A*,if u u] = difference(Xi,X2) 
% A* is matrix of squared L2 distances 

for 2 data matrices of p columns 
% A* size, i x p, £< n\U2'. only unique entries 
% /fun re-inflates A* with duplicate entries 
% Key point: Compute A*, once, 

and re-use in GP posterior computations 
% Set counter to stack all obs 

from Xi,X2 in vectorized construction 
count = 1; 
% Compute squared distances 
FOR i = 1 to m 
FOR j = 1 to n 2 

(count,:) = (x 1}i -x 2 ,j) 2 ; 



count = count + 1; 

END 
END 

% Reduce A* {ull to A* 

[A*, /fun] = unique (A^, by row); 
END FUNCTION 

Input: Data = (A*, /f uU ), & = (p,X a ,X z ) 
Output: function, [C] = C(A*, I m , 0) 
% An m x ri2 GP covariance matrix 
% Only compute inner product 
for column k where pt < 1 

selp = {p k < l}; 

p = p{sel p ) 

A* = A*(:,sel p ); 
% Compute vector of unique values for C 

-G vcc = A*[log(p)Y; 

Cvec — J - exp ( G v cc ) j 

% Re-inflate C vcc to include duplicate values 

Cvec — Cvec ( /full )i 

% Snap Cvec into matrix form, C 

C = reshape(C vec , ri2, ni)'; 
END FUNCTION 

Input: Previous covariance = C id; 

Data = (A*, /fun); Position changed = k, 
Parameters = (pk,new, Pfc,dd), Intercept = A a 

Output: [C now ] = CpartialCCoid, A*,/ fu ll, k, X a ) 

% Compose new covariance matrix, C ne w> 

from old, C Q id 
% Compute inner products only for row k of A* 
% Produce matrix of multiplicative differences 

from old to new 

-AG vcc = A*(:,fc)xlog(^); 
% Re-inflate exp(— AG vec ) 

exp(-AG V cc) = exp(-AG vcc )(/fuii); 
% Re-shape — AG VCC to matrix, A 

A = reshape(exp[— AG vec ],n2,ni)'; 
% Compute C new 

C ncw = ^ Jn + (C id — A^"*J«) O 

END FUNCTION 

Procedure to Compute Inverse of A n = jl n + C: 
Input: Number of sub-sample = m, Data = X, 
Error precision = r 

Covariance parameters = = (p, A a , A z ) 
Output: A" 1 

% Randomly select m<n observations 
on which to project n x 1, z(x) 

ind = random, permutations, latin, hypercube(ra); 
% space-filling 
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X m = X(md(l:m) 1 :); 

% Compute squared distances, A^,A* 

[^m)^m,full] = difference(X m ,X m ); % m x n 
[A*, Ifuji] = difference (X m ,X); % n x n 

% Compose associated covariance matrices 

C(m,m) = C(Am, I m ,iull, ©) ; 
C(m,n) = C (^4* ) -^full, ©); 

% Compute A n 

A n = -In + C( m ,„)C( m , m )C(m,n); 

% Compute A" 1 employing 
term-by-term multiplication 

An 1 = rl n - r 2 C'( m ^ [C( m>m ) 

END 
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